In the process of data exploration we often generate many new data objects. Just as with figures, we may also save data objects for later use.
The following call saves an “image” of all data obejcts in a compressed file format (conventional: *.Rdat)
save.image(file="04.Rdat")
Task: What happens if the “file=” parameter is omitted? Saving all data can be an overkill. Find out how to save only selected or single objects, e.g. S.
Rdat files are most convenient for future processing with R. Sometimes we need to save objects in (tab-separated) text files:
write.table(iris, file="iris.tsv", sep="\t", row.names=FALSE, quote=FALSE)
In RStudio, explore the “Environment” tab. On the console there are also many convenient commands to track (and delte) objects from memory.
ls() # list all obejct
rm(x) # remove object x
# rm(list=ls()) # Careful: This removes all objects
Reading external data into R is easy, but it is crucial to have a very clean and well-defined data structure.
# load("04.Rdat") # easiest and fastest for previously generated Rdat files
iris_f=read.table(file="../data/iris.tsv", sep="\t", header=TRUE)
str(iris_f)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris_f)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
In reality, many external data is badly formatted, leading to much wasted time - before any analysis.
file="https://raw.githubusercontent.com/maxplanck-ie/Rintro/master/data/GeneList.tsv" # locate at github https://github.com/maxplanck-ie/Rintro
read.table(file)
## Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 2 did not have 6 elements
read.table(file, header=TRUE) # watch out for headers
## Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 6 elements
read.table(file, header=TRUE, comment.char = "%") # comment lines
## Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 6 elements
read.table(file, header=TRUE, comment.char = "%", sep="\t") # separators
## GeneName Description Exp1 Exp2 Exp3 Planned
## 1 DEF1 1.3 4.7 NA NA
## 2 5XYZ\tpseudogene\t50\t20\t12\t\n3UVW 10.0 20.0 34 NA
## 3 Sep7 <NA> 20.0 NA 22 NA
## 4 Oct1 important 1.0 1.0 1 NA
read.table(file, header=TRUE, comment.char = "%", sep="\t", quote="\"") # quote character. What is the default?
## GeneName Description Exp1 Exp2 Exp3 Planned
## 1 ABC unknown 1.2 1.5 1.3 NA
## 2 DEF1 1.3 4.7 NA NA
## 3 5'XYZ pseudogene 50.0 20.0 12.0 NA
## 4 3'UVW 10.0 20.0 34.0 NA
## 5 Sep7 <NA> 20.0 NA 22.0 NA
## 6 Oct1 important 1.0 1.0 1.0 NA
# Now fill the data frame
D=read.table(file, header=TRUE, comment.char = "%", sep="\t", quote="\"")
is.na(D) # any NA's?
## GeneName Description Exp1 Exp2 Exp3 Planned
## [1,] FALSE FALSE FALSE FALSE FALSE TRUE
## [2,] FALSE FALSE FALSE FALSE TRUE TRUE
## [3,] FALSE FALSE FALSE FALSE FALSE TRUE
## [4,] FALSE FALSE FALSE FALSE FALSE TRUE
## [5,] FALSE TRUE FALSE TRUE FALSE TRUE
## [6,] FALSE FALSE FALSE FALSE FALSE TRUE
colSums(is.na(D))
## GeneName Description Exp1 Exp2 Exp3 Planned
## 0 1 0 1 1 6
i=which(rowSums(is.na(D[,3:5]))==0) # keep rows without NA
j=which(colSums(is.na(D))==nrow(D)) # remove columns with all NA
D[i,-j]
## GeneName Description Exp1 Exp2 Exp3
## 1 ABC unknown 1.2 1.5 1.3
## 3 5'XYZ pseudogene 50.0 20.0 12.0
## 4 3'UVW 10.0 20.0 34.0
## 6 Oct1 important 1.0 1.0 1.0
Notice1: The following code fragments took ~2 hours of trial and error. It is kept only for reference. No need to execute
Notice2: R can read from http. For slow network connections, it will be faster to first download the file to local storage and work from there.
# consider a data file from one of the first large scale ChIP-studies (Rick Young lab: yeast)
# 1. Find their homepage (google) and go to the "Data Download"" section.
# 2. find the Publication by Lee et al. (2002) Transcriptional Regulatory Networks in Saccharomyces cerevisea.
# 3. In the "Download Raw Data" section copy the second link ("binding of regulators to genes (text file)"
# this file contains binding data (p-values and ratios) for more than 6000 yeast genes and 113 yeast transcription factors
chip="http://jura.wi.mit.edu/young_public/regulatory_network/binding_by_gene.tsv"
# first problem: two headers --> skip first --> skip line 1
D=read.table(file=chip, header=TRUE, skip=1, sep="\t")
nrow(D) # 511 << 6000 !!!???
# second problem: quote symbols in names (')
cf=count.fields(chip,sep="\t") # many NA --> quoted fields got missing
D=read.table(chip, header=TRUE, sep="\t", quote="\"", skip=1)
rownames(D)=D[,1] # ORF-symbols in column 1, keep them as rownames
ic=c(1:4,seq(6,230,by=2)) # define column indices to be excluded (only keep p-values)
P=D[,-ic] # new data-frame of p-values, exclude superfluous columns and ratios
write.table(P,file="../data/chip.tsv", sep="\t", quote=FALSE) # write p-values to file
Let’s start with a cleaned-up data set. Now available at “https://raw.githubusercontent.com/maxplanck-ie/Rintro/master/data/chip.tsv” (locate this raw file on https://github.com/maxplanck-ie/Rintro/)
fn="https://raw.githubusercontent.com/maxplanck-ie/Rintro/master/data/chip.tsv" # remote filename
P=read.table(file=fn, header=TRUE, sep="\t")
Describe data and sanity checks
str(P) # describe the data structure
which(is.na(P)) # make sure everything is properly defined
which(P<0 || P>1) # make sure p-values are all between 0 and 1
Now we can start looking at something more interesting
# defined boolean matrix of TF-gene interactions (based on p-value threshold)
pt=1e-3
B=P<pt # result is matrix.
# reduce problem size, heatmap slow
i=rowSums(B)>1 # only genes with sufficient number of TF-binding
j=colSums(B)>40 # only TF with sufficient number of targets
# heatmap needs numeric matrix ==> convert logical to integer
B=1*B # equivalent tricks: B[B]=1; mode(B) = "integer", but _not_ as.integer(B) # strips attributes, i.e. dimensionsas
heatmap(B[i,j], scale="none", col=c("white","black"), labRow=FALSE)
#d3heatmap(B[i,j], scale="none", col=c("white","black"), distfun = function(c) dist(c, method="binary"))
Explore relation between FHL1 and RAP1
smoothScatter(log(P[,"FHL1"]),log(P[,"RAP1"]),main="smoothScatter of p-values") # scatterplot of log(p-values) for two TF
abline(h=log(pt), v=log(pt), col="red") # "arbitrary" thresholds
cor(B[,"FHL1"],B[,"RAP1"]) # Phi-coefficient = Pearson CC for binary variables
## [1] 0.5819476
Contigency tables
table(B[,"FHL1"]) # 194 genes bound by FHL1
##
## 0 1
## 6076 194
tb=table(B[,"FHL1"],B[,"RAP1"]) # contingency table: 119 genes bound by FHL1 and RAP1
tb
##
## 0 1
## 0 5989 87
## 1 75 119
fisher.test(tb) # the overlap is highly unexpected
##
## Fisher's Exact Test for Count Data
##
## data: tb
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 75.19957 157.98637
## sample estimates:
## odds ratio
## 108.6888
chisq.test(tb)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tb
## X-squared = 2104.6, df = 1, p-value < 2.2e-16
Generalize: Find other groups of TF with common targets
C=cor(B[,j]) # correlation matrix. Alternative: C=cor(log(P+eps))
heatmap(C,col = rev(grey.colors(100)) )
Heatmaps can summarize overall data, but are often difficult to explore. Additional tools are needed. Enter the world of R packages.
Explore CRAN “https://cran.r-project.org” orGoogle “interactive heatmaps in R”
#install.packages("heatmaply") # only need to install once (but takes some time: ~10min)
library(heatmaply) # only need once per session. notice messages and warnings
## Warning: package 'heatmaply' was built under R version 3.2.5
## Loading required package: plotly
## Warning: package 'plotly' was built under R version 3.2.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.5
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: viridis
## Warning: package 'viridis' was built under R version 3.2.5
## Loading required package: viridisLite
## Warning: package 'viridisLite' was built under R version 3.2.5
## Warning: replacing previous import by 'magrittr::%>%' when loading
## 'dendextend'
## Warning: replacing previous import by 'magrittr::%>%' when loading
## 'heatmaply'
## Warning: replacing previous import by 'dendextend::as.ggdend' when loading
## 'heatmaply'
## Warning: replacing previous import by 'dendextend::color_branches' when
## loading 'heatmaply'
## Warning: replacing previous import by 'dendextend::find_k' when loading
## 'heatmaply'
## Warning: replacing previous import by 'dendextend::is.dendrogram' when
## loading 'heatmaply'
## Warning: replacing previous import by 'dendextend::is.hclust' when loading
## 'heatmaply'
## Warning: replacing previous import by 'dendextend::rotate' when loading
## 'heatmaply'
## Warning: replacing previous import by 'dendextend::seriate_dendrogram' when
## loading 'heatmaply'
## Warning: replacing previous import by 'dendextend::set' when loading
## 'heatmaply'
##
## ---------------------
## Welcome to heatmaply version 0.10.1
## Type ?heatmaply for the main documentation.
## The github page is: https://github.com/talgalili/heatmaply/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
##
## To suppress this message use: suppressPackageStartupMessages(library(heatmaply))
## ---------------------
heatmaply(C,col = rev(grey.colors(100)) )
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: replacing previous import by 'shiny::includeHTML' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::knit_print.shiny.tag' when
## loading 'crosstalk'
## Warning: replacing previous import by 'shiny::code' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::includeScript' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::includeMarkdown' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::tags' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::is.singleton' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::withTags' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::img' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::tagAppendAttributes' when
## loading 'crosstalk'
## Warning: replacing previous import by 'shiny::knit_print.shiny.tag.list'
## when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::knit_print.html' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::tagAppendChild' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::includeCSS' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::br' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::singleton' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::span' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::a' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::tagList' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::strong' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::tag' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::p' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::validateCssUnit' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::HTML' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::h1' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::h2' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::h3' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::h4' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::h5' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::h6' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::tagAppendChildren' when
## loading 'crosstalk'
## Warning: replacing previous import by 'shiny::em' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::div' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::pre' when loading 'crosstalk'
## Warning: replacing previous import by 'shiny::htmlTemplate' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::suppressDependencies' when
## loading 'crosstalk'
## Warning: replacing previous import by 'shiny::tagSetChildren' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::includeText' when loading
## 'crosstalk'
## Warning: replacing previous import by 'shiny::hr' when loading 'crosstalk'
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Investigating overlaps:
# Which genes are commonly bound by FHL1 and RAP1?
names(which(B[,"FHL1"]==1 & B[,"RAP1"]==1))
## [1] "YBL087C" "YBR084C-A" "YBR085W" "YBR181C" "YBR188C"
## [6] "YBR189W" "YBR190W" "YBR191W" "YDL075W" "YDL076C"
## [11] "YDL133C-A" "YDL133W" "YDL136W" "YDL184C" "YDL191W"
## [16] "YDR024W" "YDR025W" "YDR447C" "YDR448W" "YDR449C"
## [21] "YDR450W" "YDR470C" "YDR471W" "YDR500C" "YDR501W"
## [26] "YEL054C" "YER074W" "YER101C" "YER102W" "YER116C"
## [31] "YER117W" "YFL034C-A" "YFL034W" "YFR031C-A" "YGL030W"
## [36] "YGL031C" "YGL103W" "YGL104C" "YGL123W" "YGL124C"
## [41] "YGL135W" "YGL136C" "YGL189C" "YGR033C" "YGR034W"
## [46] "YGR117C" "YGR118W" "YGR148C" "YGR149W" "YHL001W"
## [51] "YHR021C" "YHR021W-A" "YHR141C" "YHR142W" "YHR203C"
## [56] "YHR204W" "YIL018W" "YIL133C" "YIL148W" "YIL149C"
## [61] "YJL134W" "YJL135W" "YJL136C" "YJL177W" "YJL178C"
## [66] "YJL189W" "YJL190C" "YJL191W" "YJL192C" "YKL006C-A"
## [71] "YKL006W" "YKL180W" "YLR047C" "YLR048W" "YLR287C-A"
## [76] "YLR325C" "YLR326W" "YLR333C" "YLR344W" "YLR387C"
## [81] "YLR388W" "YLR441C" "YLR447C" "YLR448W" "YML024W"
## [86] "YML025C" "YML026C" "YML063W" "YML064C" "YML073C"
## [91] "YMR142C" "YMR143W" "YMR242C" "YNL069C" "YNL096C"
## [96] "YNL162W" "YNL163C" "YNL302C" "YOL039W" "YOL040C"
## [101] "YOL120C" "YOL127W" "YOL128C" "YOR095C" "YOR096W"
## [106] "YOR234C" "YOR235W" "YOR292C" "YOR293W" "YOR312C"
## [111] "YPL079W" "YPL080C" "YPL143W" "YPL249C-A" "YPR080W"
## [116] "YPR102C" "YPR103W" "YPR131C" "YPR132W"
# create a Venn Diagram of target gene overlaps (with package gplots)
library(gplots)
## Warning: package 'gplots' was built under R version 3.2.4
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
L=list(FHL1=which(B[,"FHL1"]==1),RAP1=which(B[,"RAP1"]==1),MCM1=which(B[,"MCM1"]==1), STE12=which(B[,"STE12"]==1) )
vn=venn(L)
# better than Venn
library(UpSetR)
## Warning: package 'UpSetR' was built under R version 3.2.5
## Warning: replacing previous import by 'scales::alpha' when loading 'UpSetR'
upset( as.data.frame(B), sets = c("FHL1", "RAP1", "MCM1", "STE12"), empty.intersections = TRUE )